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Abstract 

Input-sensitive profiling is a recent performance analysis 
technique that makes it possible to estimate the empirical 
cost function of individual routines of a program, helping de- 
velopers understand how performance scales to larger inputs 
and pinpoint asymptotic bottlenecks in the code. A cuiTent 
limitation of input-sensitive profilers is that they specifically 
target sequential computations, ignoring any communication 
between threads. In this paper we show how to overcome this 
limitation, extending the range of applicability of the origi- 
nal approach to multithreaded applications and to applica- 
tions that operate on I/O streams. We develop new metrics 
for automatically estimating the size of the input given to 
each routine activation, addressing input produced by non- 
deterministic memory stores performed by other threads as 
well as by the OS kernel (e.g., in response to I/O or net- 
work operations). We provide real case studies, showing 
that our extension allows it to characterize the behavior 
of complex applications more precisely than previous ap- 
proaches. An extensive experimental investigation on a vari- 
ety of benchmark suites (including the SPEC OMP2012 and 
the PARSEC benchmarks) shows that our Valgrind-based 
input-sensitive profiler incurs an overhead comparable to 
other prominent heavyweight analysis tools, while collect- 
ing significantly more performance points from each profil- 
ing session and coiTectly characterizing both thread-induced 
and external input. 



Categories and Subject Descriptors C.4 [Performance of 
Systems]: Measurement Techniques; D.2.8 [Software Engi- 
neering]: Metrics — ^performance measures 

General Terms Algorithms, Measurement, Performance. 

Keywords Asymptotic analysis, dynamic program anal- 
ysis, instrumentation, I/O streams, multithreading, perfor- 
mance profiling, Valgrind, workload characterization. 

1. Introduction 

Performance profilers collect information on running appli- 
cations and associate performance metrics to software loca- 
tions such as routines, basic blocks, or calling contexts ui 
ISl UM- They play a crucial role towards software compre- 
hension and tuning, letting developers identify hot spots and 
guide optimizations to portions of code that are responsible 
of excessive resource consumption. 

Unfortunately, by reporting only the overall cost of por- 
tions of code, traditional profilers do not help programmers 
to predict how the performance of a program scales to larger 
inputs. To overcome this limitation, some recent works have 
addressed the problem of designing and implementing per- 
formance profilers that return, instead of a single number 
representing the cost of a portion of code, a cost function 
that relates the cost to the input size (see, e.g., (jsl Isl bill ). 
This approach is inspired by traditional asymptotic analy- 
sis of algorithms, and makes it possible to analyze - and 
sometimes predict - the behavior of actual software imple- 
mentations run on deployed systems and realistic workloads. 
Some of the proposed methods, such as [8], perform multi- 
ple runs with different and determinable input parameters, 
measure their cost, and fit the empirical observations to a 
model that predicts performance as a function of workload 
size. More recent approaches make a step further, tackling 
the problem of automatically measuring the size of the input 
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given to generic routines islluJ], collecting data from multi- 
ple or even single program runs. 

As observed in ||5|] and 03111 . a current limitation of input- 
sensitive profilers is that they specifically target sequen- 
tial computations, ignoring any communication between 
threads. Multithreaded applications based on concurrent pro- 
gramming are traditionally difficult to analyze, since threads 
can interleave in a nondeterministic fashion and affect the 
behavior of other threads. Nevertheless, they are widespread 
in modern multicore architectures, making the quest for dy- 
namic analysis tools for concurrent computations extremely 
critical. 

Our contribution. In this paper we show how to extend 
the input-sensitive profiling methodology to the full range 
of concurrent applications, hinging upon the approach de- 
scribed in jsl]. The ability to automatically infer the size 
of the input data on which each routine activation oper- 
ates is a crucial issue in input-sensitive profiling, but cur- 
rent techniques may fail to properly characterize the input 
size in a multi-threaded environment. As shown in this pa- 
per, if the input size is not estimated correctly, the analy- 
sis of profiling data can lead to uninformative cost plots or 
even to misleading results. As a first contribution we there- 
fore propose a novel metric, called threaded read memory 
size, that overcomes this limitation, addressing input pro- 
duced by non-deterministic memory stores performed by 
other threads and by the OS kernel (e.g., in response to 
I/O or network operations). We provide real case studies, 
based on the MySQL database management system and on 
the vips image processing tool, showing that our exten- 
sion allows it to characterize the behavior of complex appli- 
cations more precisely than previous approaches. We then 
demonstrate that the input size of a routine can be auto- 
matically and efficiently computed in a multithreaded set- 
ting, and discuss the implementation of a Valgrind-based 
input-sensitive profiler for concurrent applications (the tool 
is available at http : //code . google . com/p/aprof /). An ex- 
tensive experimental investigation on a variety of bench- 
mark suites (including the SPEC OMP2012 and the PAR- 
SEC benchmarks) shows that our profiler incurs an overhead 
comparable to other prominent Valgrind tools, while collect- 
ing significantly more performance points from each profil- 
ing session and correctly characterizing both thread-induced 
and external input. 

Paper organization. The remainder of this paper is orga- 
nized as follows. In Section|2]we introduce the threaded read 
memory size metric, showing its usefulness through syn- 
thetic examples. Case studies drawn from real applications 
are discussed in Section |3] Section |4] proposes an efficient 
algorithm for computing the threaded read memory size of 
a routine activation. Section |5] describes the most relevant 
implementation aspects and Section|6]presents the results of 
our experimental evaluation. Related work is discussed in 
Section|7]and concluding remarks are given in Section|8] 
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Figure 1. Threaded read memory size examples. 

2. Multithreaded Input Size Estimation 

A crucial issue in input-sensitive profiling is the ability to 
automatically infer the size of the input data on which each 
routine activation operates. This can be done in a single- 
threaded scenario using the read memory size metric intro- 
duced in ||5|] : 

Definition 1 (|5]). The read memory size (RMS) of the exe- 
cution of a routine r is the number of distinct memory cells 
first accessed by r, or by a descendant of r in the call tree, 
with a read operation. 

The intuition behind this metric is the following. Consider 
the first time a memory location t is accessed by a routine 
activation r: if this first access is a read operation, then (. 
contains an input value for r. Conversely, if i is first written 
by r, then later read operations will not contribute to increase 
the RMS since the value stored in I was produced by r itself. 
The RMS, although very effective in single-threaded exe- 
cutions, may fail to properly characterize the input size of 
routine activations in a multi-threaded environment. Con- 
sider, as an example, the execution described in Figure [T^: 
routine / in thread Ti reads location x twice, but only the 
first read operation is a first access. Hence, RMS/ = 1. No- 
tice however that routine g in thread T2 overwrites the value 
stored in x before the second read by /: this read operation 
gets a value that is not produced by routine / itself and that 
should be therefore regarded as new input to /. The same 
drawbacks discussed in the above example arise when one 
or more memory locations are repeatedly loaded by a rou- 
tine with values read from an external source, e.g., network 
or secondary storage. To overcome these issues, we propose 
a novel metric for estimating the input size, which we call 
threaded read memory size. 

Definition 2. Let r be a routine activation by thread t and 
let £ be a memory location. An operation on i is: 

• a first-access, if i has never been accessed before by r or 
by any of its descendants in the call tree; 

• an induced first-access, if the latest vrite{i) performed 
by any thread t' 7^ t, if any, has not been followed by an 
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procedure producer () procedure consumer () 



while (1) do 

wait (empty) 
wait (rTiuiex) 
produceDataCx) 
signal (mutex) 
signal (/uZO 



while (1) do 

wait (full) 
wait (mutex) 
consumeData(x) 
signal (m,utex) 
signal (empty) 



Figure 2. Producer-consumer pattern: when n values have 

been produced, RMSconsumer = 1 while TRMSconsumer = n. 

access to £ by routine r or by any of its descendants in 
the call tree. 

Definition 3. Let r be a routine activation by thread t. The 
threaded read memory size TRMSr,t ofr with respect to t is 
the number of read operations performed by r that are first- 
accesses or induced first-accesses. 

We notice that the RMS coincides with the number of read 
operations that are first-accesses and therefore 



TRMSrf > RMSr 



for each routine activation r and thread t. 



(1) 



Example 1. Consider again the example in Figure [T^: we 
have TRMS/_Ti = 2. The first read operation on x is indeed a 
first-access, while the second one is an induced first-access: 
between the latest write operation on x performed by thread 
T2 ^ Ti and the second read(a;) by routine / there are no 
other accesses to x by /. 

Example 2. Consider Figure ^p. In this case RMS/i = 1 
and RMS/ = 1: function / performs three read operations on 
X (one of which through its subroutine h), but only the first 
one is a first-access and contributes to its RMS. With respect 
to the TRMS/_Ti, '^he read operation by h is an induced first- 
access for / (similarly to the previous example), while the 
third read is not: between the latest write operation on x 
performed by thread T2 ^ Ti and the third read(a;), / has 
already accessed x through its descendant h. 

We also have TRMS/j.Ti = 1- Notice that the read op- 
eration in h could be regarded both as a first-access and as 
an induced first-access with respect to h: since we are inter- 
ested in characterizing communication between threads via 
shared memory, we will classify accesses of this kind as in- 
duced first-accesses. 

Example 3. Producer-consumer is a classical pattern in 
concurrent applications. The standard implementation based 
on semaphores (see, e.g. |27]) is shown in Figure |2] where 
producer and consumer run as different threads and routines 
produceData and consumeData write to and read from 
memory location x, respectively (the implementation can 
be easily extended to buffered read and write operations). 
For simplicity of exposition, we will not consider memory 



procedure extemalReadO 
1: let 6 a buffer of size 2 
2: for i = 1 to n do 
3: load h with external data 

4: consumeData (6[0]) 



// does not imply read of h 
II read and process 6[0] 



Figure 3. Buffered read from an external device: after n 

iterations, RMSexternalRead = 1 and TRMSexternalRead = n. 

accesses due to semaphore operations. With this assump- 
tion, RMSconsumer — 1, sincc the consumcr repeatedly reads 
the same memory location x. Conversely, the threaded read 
memory size gives a correct estimate of the consumer's in- 
put size: whenever producer has generated n values written 
to location x at different times, we have TRMSconsumer = n. 
Indeed, all read operations on x are induced first-accesses: 
thanks to the interleaving guaranteed by semaphores, each 
read(a;) in consumeDatais alwaysprecededby a write(x) 
in produceData. 

Example 4. The example in Figure[3]describes the case of 
buffered read operations. Procedure externalRead loads 
2n values from an external device (line 3): this is done by 
the operating system that fills in buffer b with new data at 
each iteration. These load operations, however, should not 
be implicitly regarded as read operations performed by the 
running thread: as shown in line 4, only one of the two val- 
ues loaded at each iteration is actually read and processed 
by procedure externalRead. Hence, at the end of the ex- 
ecution TRMSexternalRead ~ n, due to the Ti induced first- 
accesses at line 4. Notice that RMSgxtemaiRead = 1 since data 
items are loaded across iterations on the same two memory 
locations b[0] and b[l] but only b[0] is repeatedly read. We 
will further discuss the interaction between kernel system 
calls and threads in Section l43] 

3. Case Studies 

In this section we discuss the utility of the TRMS metric in 
real applications. We show a variety of cases where TRMS 
correctly characterizes the input size where RMS either fails 
or does not collect sufficient profiling data. Our examples 
are based on the aprof-trms tool described in Section |5] 
and use basic block (BB) counts as performance metric. 

Input sensitive profiles can be naturally used to produce 
performance charts where some cost measure is plotted 
against the TRMS or the RMS. For instance, for each dis- 
tinct input size n of a routine r, we can plot the maximum 
time spent by an activation of r on input size n (worst-case 
running time plot) or the number of times r was activated on 
an input of size n (workload plot). Similar charts could be 
produced for different cost measures (e.g., average running 
time), though we will not use them throughout this section. 

Our discussion is based on two different applications: 
MySQL, a relational database management system 12 ih . and 
vips, an image processing software package included in 
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Figure 4. Function mysql_select of MySQL: worst-case 
running time plots respectively obtained using RMS or TRMS 
as an estimate for the input size. 
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Figure 5. Function im^generate of vips (PARSEC 2.1): 
worst-case running time plots respectively obtained using 
RMS or TRMS as an estimate for the input size. 

the PARSEC 2.1 benchmark suite [3]. MySQL manages ev- 
ery new connection to the database by means of a separate 
thread, which contends for access to different shared data 
structures, and uses both I/O and network intensively. We 
also remark that vips is a data-parallel application, which 
constructs multi-threaded image processing pipelines in or- 
der to apply fundamental image operations such as affine 
transformations and convolutions. 

Impact of input size estimation on asymptotic trends. If 

the input size is not estimated correctly, the analysis of pro- 
filing data can lead to misleading results. Consider for in- 
stance the following scenario: we have n activations ri... r„ 
of a routine r, activation r^ has cost i and performs i read op- 
erations, out of which \i/2] are first-accesses and [i/2j are 
induced first-accesses. Hence, TRMS^^ ~ r*/2] + L*/2J — i 
while RMSr; = R/2]- Notice that TRMSr; > RMS^^, in ac- 
cordance with Inequality [T] In the worst-case running time 
plot obtained using the TRMS we have n distinct points and 
the running time grows as the function f{x) = x. Con- 
versely, in the worst-case running time plot obtained us- 
ing the RMS we have only n/2 points: any two consecu- 
tive activations r, and r^+i (with i odd) have the same RMS 
value \i/2] and the worst-case cost is i + 1 (i.e., the max- 
imum between costs i and i + 1 of the two activations). 
Hence, the running time appears to grow as the function 
f{x) = 2x. The problem would be even more critical if, 
e.g., RMSri — [logij: in this case, in the worst-case plot 
obtained using the RMS, the running time would appear to 
grow exponentially as f{x) — 2^. 
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Figure 6. Function buf_f lush_buf f ered_writes of 
MySQL: worst-case running time plots with curve fitting. 

As shown by figures |4l|5] and|6l similar phenomena can 
arise in practice in I/O bounded or multithreaded applica- 
tions. The running time of routine mysql_select in Fig- 
ure|4]appears to grow (at least) quadratically when we mea- 
sure the input size by means of the RMS, and linearly using 
the TRMS. In this experiment, the query operation simply se- 
lects all tuples in the table and is applied to tables of increas- 
ing sizes: at each query, tuples are partitioned into groups, 
each group is loaded into a buffer through a kernel system 
call and is then read by routine mysql_select. The RMS 
does not count repeated buffer read operations: hence, the 
input size on larger tables is exactly the same as in smaller 
ones (it roughly coincides with the buffer size), while the 
running time grows due to the larger number of buffer loads. 

Routine im_generate in benchmark vips shows an 
analogous effect (see Figure|5]l. In this case the induced first- 
accesses not counted in the RMS are due to the interaction 
between threads via shared memory. In both examples, the 
RMS plot appears to reveal an asymptotic bottleneck, which 
instead does not actually exist. In other cases, the scenario 
might be the opposite: the RMS may not reveal the existence 
of a possible performance bottleneck, which can be instead 
characterized using the TRMS. For instance, the TRMS plot 
of routine buf _f lush_buf f ered_writes of MySQL in Fig- 
ure |6] shows a superlinear running time, while the RMS plot 
only suggests a linear growth, as highlighted by standard 
curve fitting techniques. 

Profile richness. The usefulness of input-sensitive profile 
data crucially depends on the number of distinct input size 
values collected for each routine: each value corresponds to 
a point in the cost plots associated to a routine, and plots with 
a small number of points do not clearly expose the behavior 
of the routine. In our experiments, we observed that the use 
of TRMS instead of RMS can often yield a larger number of 
distinct input size values and thus more informative plots. 

An example is provided by Figure |7] while routine 
wbuf f er_write_thread was called 110 times during the 
execution of application vips, according to the RMS metric 
all its input sizes collapsed onto two distinct values (67 and 
69, as shown in Figure |7h)- However, this routine performs 
many load operations from secondary memory: hence, if we 
also take into account external input (Figure|7j5), or external 
input combined with thread-input (Figure |7};), the number 
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Figure?. Function wbuffer_write_thread of vips (PARSEC2.1): (a) RMS cost plot; (b) TRMS cost plot with external input 
only; (c) TRMS cost plot with both external and thread input. 
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Figures. Function Protocal: :seiid_eof of MySQL: work- 
load plots respectively obtained using RMS or TRMS as an 
estimate for the input size. 

of distinct TRMS values grows considerably and the trend in 
the cost plots becomes more meaningful. 

Workload and input characterization. An additional ben- 
efit of input-sensitive profiling is the capability of char- 
acterizing the typical workloads on which a routine is 
called in the context of deployed systems. Richer profile 
data collected using the TRMS metric yield more accurate 
workload characterization, as shown, e.g., by the work- 
load plots of Figure [S] Moreover, our multithreaded input- 
sensitive profiling methodology can also provide insights 
on the amount of interaction of each routine with external 
devices (external input) and cooperating threads (thread- 
induced input): for instance, 99.9% of the input of routine 
wbuf f er_write_thread (Figure |7| is due to loads from 
secondary memory and thread interaction. 

For each routine, we can automatically distinguish be- 
tween external and thread-induced input. If we sort in de- 
creasing order all routines in accordance with their percent- 
age of induced first-accesses, we obtain an interesting char- 
acterization of the interplay between workload, computa- 
tion, and concurrency, as shown in Figure |9] This figure 
plots, for each routine of benchmarks MySQL and vips, the 
percentage of induced first-accesses partitioned between ex- 
ternal input and thread-induced input: a first look reveals that 
induced first-accesses of the majority of MySQL routines are 
due to external input, differently from vips routines where 
thread input is predominant. We remark that charts of this 
kind can be automatically produced by our profiler. In Sec- 
tion |6] we will provide a quantitative evaluation of profile 
richness and input characterization in a variety of applica- 
tions on typical workloads. 



4. Computing the Multithreaded Input Size 

In this section we describe an efficient algorithm for com- 
puting the threaded read memory size of a routine activation 
and the input-sensitive profile of a routine. Routine profiles 
are thread-sensitive, i.e., profiles generated by routine acti- 
vations made by different threads are kept distinct (if neces- 
sary, they can be combined in a subsequent step). 

The profiler is given as input multiple traces of program 
operations associated with timing information. Each trace is 
generated by a different thread and includes: routine acti- 
vations (call), routine completions (return), read/write 
memory accesses, and read/write operations performed 
through kernel system calls (kernelRead and kernelWrite, 
necessary to characterize external input). 

As a first step, thread-specific traces are logically merged, 
interleaving operations performed by different threads ac- 
cording to their timestamps, in order to produce a unique ex- 
ecution trace. If two or more operations issued by different 
threads have the same timestamp, ties are broken arbitrarily: 
no assumption can be therefore done about which operation 
will be processed first. We remark that after merge and tie 
breaking, trace events are totally ordered. For simplicity of 
exposition, we also assume that switchThread events are in- 
serted in the merged trace between any two operations per- 
formed by different threads. 

For each operation issued by a routine r in a thread t, the 
profiler must update TRMS and cost information of r with 
respect to t. Some operations might also require to update 
profiling data structures related to threads other than t. To 
clarify the relationships between different threads, we first 
discuss a naive approach as a warm-up for the reader 

4.1 Naive Approach 

Let i be a thread and let r be a routine activated by t. With 
a slight abuse of notation, we will denote with TRMS^.t 
the threaded memory size of a specific activation of r in 
t. According to the definition of multithreaded input size 
(see Section |2j, computing TRMS.^ 4 requires to count read 
operations issued by routine r that are either first accesses 
or induced first-accesses. In turn, identifying induced first- 
accesses requires to monitor write operations performed by 
all threads, i.e., performed also by threads different from t. 
A simple-minded approach, which is sketched in Fig- 
ure [Tol is to maintain a set L^.t of memory locations ac- 
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Figure 9. Thread-induced vs. external input on benchmarks (a) MySQL and (b) vips. 
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Figure 10. Computation of TRMSr,t with a naive approach. 
The notation readi /writer (^)indicates that location £ is 
read/written by thread t. 



cessed during the activation of r. Immediately after entering 
r, this set is empty and TRMS^,* — 0. Memory locations can 
be both added to and removed from i^,* during the execu- 
tion of r, as follows: 

• when r reads or writes a location £, then £ is added to L^,* 
(if not already present); 

• when a thread t' ^ t writes a location £, then £ is removed 
from Lr,t (if present): this allows it to recognize induced 
first-accesses. 

With this approach, at any time during the execution of 
r, a read operation on a location £ is a first access (pos- 
sibly induced by other threads) if and only if ^ ^ Lr^t- 
Hence, TRMSr,t is increased only if this test succeeds. No- 
tice that read operations performed by threads different from 
t change neither set Lr.t nor TRMS^,*. 

We remark that in the description above r can be any rou- 
tine in the call stack of thread t (not necessarily the topmost). 
Hence, the same checks and updates must be performed for 
all pending routine activations in the call stack of t. Due to 
stack-walking and to the fact that write operations require to 
update sets L^.t of all threads, this simple-minded approach 
is extremely time-consuming. It is also very space demand- 
ing: in the worst case, each distinct memory location could 
be stored in all sets L^.t for each thread t and each routine 
activation r pending in the call stack of t. In that case, the 
space would be proportional to the memory size times the 
maximum stack depth times the number of threads. 

4.2 The Read/Write Timestamping Algorithm 

To obtain a more space- and time-efficient algorithm, we 
exploit the latest-access approach described in ||5|]. Namely, 
we avoid to store explicitly the threaded read memory size 



TRMS,. ( and the sets Lr.t of accessed memory locations. 
Instead, we maintain partial information that can be updated 
quickly during the computation and from which the TRMS 
can be easily derived upon the termination of a routine. 

In more details, we adapt the latest-access approach [51] as 
follows: for each thread t and memory location £, we store £ 
in only one set Pr^t such that r is the latest routine activation 
in t that accessed £ (either directly or by its completed 
subroutines). At any time during the execution of thread t 
and for each pending routine activation r, it holds: 

Lr.t — Pr,t U {Pr' ,t '■ f' descendant of r} 

where r' is any pending routine activation that is above r in 
the call stack at that time. Sets Pr^t will be stored implicitly 
by associating timestamps to routines and memory locations. 

Similarly to the naive approach of Figure \W\ locations 
will be both added to and removed from Pr.t to characterize 
induced first-accesses. However, this turns out to be ineffi- 
cient in a multithreaded scenario: differently from read op- 
erations that change only thread-specific sets, write accesses 
require to change the sets P,. j of each activation r pending in 
the call stack of each running thread t. By implicitly updat- 
ing only one set Pr,t per thread, the latest-access algorithm 
avoids stack walking, but the update time for write accesses 
is still linear in the number of threads, which can be pro- 
hibitive in practice. 

To overcome this problem, we combine the latest ac- 
cess approach with global timestamps that are appropriately 
updated upon write accesses to memory locations: in this 
way, we will recognize induced first-accesses by comparing 
thread-specific timestamps with global ones. The entire al- 
gorithm is sketched in Figure [TTi 

Data structures. The algorithm uses the following global 
data structures: 

• a counter count that maintains the total number of thread 
switches and routine activations across all threads; 

• a shadow memory wts such that, for each memory lo- 
cation £, wts[£] is the timestamp of the latest write op- 
eration on £ performed by any thread. The timestamp of 
a memory access is defined as the value of count at the 
time in which the access took place. 
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procedure call(r, t) 
1: count ++ 
2: topt++ 
3: Sf[topf].rtn 4- r 
4: S'f [iopfj.ts <~ cowni 
5: S'([iopt].irms ^ 
6: St[topt].cost ^ 

getCost 

procedure return(i) 
1: collect iSt[topt].rtn, 
St[topt].trms, 
getCost 0- 

St[topt]-cost) 
2: St[topt-l]-trms+= 

St[topt\-trras 
3: topt— 

procedure switchThread() 
1: count ++ 



procedure read(^, t) 



i{tst[i] < wts[i] then 

St[topt].trms ++ 
else 

ittstii] < St[topt].ts 
then 

St[topt]-trms ++ 

if tst[£] 7^0 then 

i — max idx s.t. 

St\i].ts < tst[£] 
St[i].trm,s — 
end if 
end if 
end if 
tst [i] ■<— count 



procedure write(^, t) 
1: tst[£] <r- count 
2: wts[£] 4- count 



Figure 11. TRMS algorithm; multi-threaded input. 

Similarly to p*], the algorithm also uses the following thread- 
specific data structures for each thread t: 

• a shadow memory tst such that, for each memory loca- 
tion i, tst [(] is the timestamp of the latest access (read or 
write) to i made by thread t; 

• a shadow run-time stack St, whose top is indexed by 
variable topt- For each i G [1, topt], the i-th stack entry 
St[i] stores: 

■ The id St[i]-rtn, the timestamp St[i]-ts, and the cu- 
mulative cost St [i] -cost of the i-th pending routine ac- 
tivation. 

■ The partial threaded read memory size St [i] .trms of 
the activation, defined so that the following invariant 
property holds throughout the execution for each i 
such that \ <i < topt- 



Vi, 1 < i < topt ■■ TRMS.j,t 



topt 

E 



St[j].trm,s (2) 



where TRMS^t is a shortcut for TRMSg^^jij^tri.t- At any 
time, TRMSit equals the current TRMS value of the i- 
th pending activation on the portion of the execution 
trace generated by thread t seen so far. 

As shown in {y\, Invariant|2]implies the following interesting 
property: for each pending routine activation, its TRMS value 
can be obtained by summing up its partial threaded read 
memory size with the TRMS value of its (unique) pending 
child, if any. More formally: 

TRMStopt,t = St[topt].trms 
TRMSj,t = St[i\.trms + TRMSi+i_t 



for each i G [1 , topt — 1] . Hence, if we can correctly maintain 
the partial threaded read memory size during the execution, 
upon completion of a routine we will also get the correct 
TRMS value. 

Algorithm and analysis. The partial threaded read mem- 
ory size can be maintained as shown in Figure [TT] We first 
notice that the global timestamp counter count is increased 
at each thread switch and routine call, and its value is used to 
update routine timestamps (line 4 of procedure call), global 
memory timestamps (line 2 of procedure write), and local 
memory timestamps (lines 1 and 12 of procedures write 
and read, respectively). Upon activation of a routine, pro- 
cedure call(r, t) creates and initializes a new shadow stack 
entry for routine r m St. When the routine activation termi- 
nates, its cost is collected and its partial TRMS (which at this 
point coincides with the true TRMS value according to equa- 
tion TRMStopjt = St[topt].trms discussed above) is added 
to the partial TRMS of its parent, preserving Invariant|2l 

Local timestamps of memory locations are updated both 
by read and write accesses, while global timestamps are not 
updated upon read operations (they are thus associated to 
write operations only). This update scheme makes it pos- 
sible to recognize induced first-accesses to any location (,, 
which is done by lines 1-2 of procedure read. If the read- 
/write timestamp tst [I] local to thread t is smaller than the 
global write timestamp wts[l], then location i must have 
been written more recently than the last read/write access 
to I by thread t. Note that, if the latest access to (. was a 
write operation by thread t, then it would be tst [i] = wts[P\ 
(see procedure write), letting the test tst[i] < wts[£] fail. 
Hence, if the test succeeds, the last write on i must have been 
done by some thread t' ^ t, the read access by t is an in- 
duced access, and the partial TRMS of the topmost routine is 
correctly increased by line 2 of procedure read. Invariant|2] 
is fully preserved by this assignment: the accessed value is 
new not only for the topmost routine in the call stack St, but 
also for all its ancestors, whose TRMS is implicitly updated 
according to Equation|2] 

On the other side, if the test of line 1 of procedure read 
fails, the read access to i might still be a first access: this 
happens if the last access to location £ by thread t took place 
before entering the current (topmost) routine. Lines 4-10 
address this case, updating the partial TRMS as described 
in 1 5]: the partial TRMS of the topmost routine is increased, 
while the partial TRMS of an appropariately chosen ancestor 
is decreased (it is proved in [5] that Invariant|2]is preserved). 

The running time of all operations is constant, except for 
line 7 of procedure read that requires 0{\ogdt) worst case 
time, where dt is the depth of the call stack St- 

4.3 External Input 

In Section 14.21 we have focused on induced first-accesses 
generated by multi-threaded executions. In this section we 
show that the read/write timestamping algorithm can be nat- 
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procedure kernelWrite(^) procedure kernelRead(^, t) 
1: count ++ 1: read(^,i) 

2: wts[£] ■<— count 

Figure 12. trms algorithm: external input. 

urally extended to take into account also induced accesses 
due to external input. 

Procedures kernelRead and kernelWrite shown in 
Figure [12] update the profiler's data structures when mem- 
ory accesses are mediated by kernel system calls. Threads 
invoke system calls to get data from external devices (e.g., 
a disk or the network) or to send data to external devices. 
We remark that the operating system kernel must be treated 
differently from normal threads in our algorithm, since there 
are no kernel-specific shadow memory and shadow stack. 

When a thread sends data to an external device, it must 
delegate the operating system to read the memory locations 
containing those data and write their content to the device. 
Hence, a thread external write operation corresponds to a 
kernelRead event in the execution trace. As shown in Fig- 
ure[T2l read memory accesses by the operating system are re- 
garded as read operations implicitly performed by the thread, 
as if the system call were a normal subroutine. Upon a 
kernelRead event, it is therefore sufficient to invoke proce- 
dure read that, if necessary, will update the threaded TRMS 
of pending routine activations. 

The case of kernelWrite operations is slightly more 
subtle. When a thread needs data from a external device, 
it delegates the operating system to write the device data 
to some memory buffer (if the buffer consists of n mem- 
ory locations, the execution trace will contain n distinct 
kernelWrite events). This buffer load, however, should not 
be regarded as a thread external read operation: indeed, it 
may happen that only a subset of the loaded memory loca- 
tions will be actually processed (and thus read) by the thread, 
and only those subset should be counted as external input. 
For this reason procedure kernelWrite does not directly 
change the partial TRMS of the topmost routine. Instead, 
it first increases count and then associates buffer memory 
locations with a global write timestamp that is larger than 
any thread-specific timestamp. This forces the test tst [i] < 
wts[i] to succeed if a buffer location £ will be subsequently 
read by the thread, properly increasing the partial TRMS only 
for actual read operations. 

4.4 Counter Overflows 

The global counter used by the read/write timestamping al- 
gorithm is common to all running threads and in our initial 
experiments was affected by frequent overflows, especially 
for long-running applications. Unfortunately, overflows are 
a serious concern in the computation of the TRMS, since they 
alter the partial ordering between memory timestamps yield- 
ing wrong input size values. To overcome this issue, we per- 



procedure counterOverf lowO 



for each running thread t do 
for z = 1 to topt do 

add St [i] -ts to set A of active timestamps 
sort(^) 

for each running thread t do 
for z = 1 to topt do 

p — position of timestamp St [i] .ts in A 
St[i].ts — 3 ■ p 
for each memory location i do 

q = max idx in A s.t. wts[£] < A[q] 
for each running thread t do 

i{tst[e] < A[q] V tst[£] > A[q + 1] then 
j — max idx in A s.t. tst[i] > A[j] 

tst[e] = 's-j 

tlitwts[i] = tst[i] then tst[e] = 3-q + l 
elitwtsle] > tst[£] then tst[£] = 3-q 
else tst[i] = 3-q + 2 
wts[£] =3-q+l 
count = 3 • |A| + 3 



Figure 13. Counter overflow procedure. 



form a periodical global renumbering of timestamps in the 
profiler's data structures, taking care not to alter the partial 
order between ist[^], wis[£], and St[i].ts for each memory 
location £, running thread t, and 1 < i < topt. Instead, 
we exploit the following observation: in Figure [TT] there is 
no comparison between u;is[^] and wts[£'] or between tst [£] 
and tst [£'], for £ ^ £'. Hence, the order between timestamps 
of different memory locations is irrelevant and can be arbi- 
trarily changed. 

Our renumbering algorithm is sketched in Figure [T3] For 
the sake of efficiency, the algorithm checks and renumbers 
each timestamp only once. Lines 1-4 collect all timestamps 
in the call stacks of running threads and sort them in increas- 
ing order Notice that these timestamps are distinct: count 
is increased by procedure call (see Figure [TT]) so that a 
new activation is always assigned an unused value, and the 
renumbering algorithm keeps the property true. 

Routine timestamps are re-assigned in lines 5-8: the new 
timestamps are multiples of 3 (this choice will be justified 
below) and are chosen according to the rank of the original 
routine timestamp in the sorted set A. This guarantees that 
the original ordering between any two routine timestamps 
is preserved, and that the maximum value of a timestamp 
will be proportional to the total number of pending routine 
activations (i.e., \A\). 

Thread-specific and global timestamps of memory loca- 
tions are re-assigned in lines 9-18. According to line 10, let 
A[q] be the latest pending routine activation (in any thread) 
started before the latest write to memory location £. A thread 
t could have accessed £ before this activation (i.e., tst[£] < 
A[q]), between pending routine activations A[q] and yl[g + 1] 
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(i.e., A[q] < tst[i] < A[q + 1]), or after pending routine ac- 
tivation A[q + 1] (i.e., tst[i] > A[q + 1]). If g + 1 is not a 
valid index for A and tst [i] > A[q] we can assume to be in 
the second case. The first and the third cases can be treated 
by assigning tst [i] with the same value used for the most re- 
cent activation j such that tst[i] > A[j] (lines 12-14): this 
guarantees that comparisons between tst [i] and any routine 
timestamp at lines 4 and 7 of procedure read will succeed 
if and only if they succeeded before renumbering. The sec- 
ond case iA[q] < tst[£] < A[q + 1]) requires to distinguish 
between three different situations, which explains why new 
routine timestamps are chosen in line 8 as multiples of 3: 

1. wts[i] — tst[i]: t was the last thread to write location i. 
After renumbering (lines 15 and 18), tst[£] = wts[£] = 
3(7+1. This guarantees that both A[q] = 3g < tst[i] = 
wts[e] ^3q + l<A[q + l]= 3{q + 1); 

2. wts[£] > tst[£]: thread t has accessed location £ before 
its last write. In this case the new timestamp of tst [£] is 3g 
(line 16). This preserves both the relations tst[£] — 3q < 
wts[£] == 3q + 1 and tst[£] = 3q < A[q + 1] = 3{q + 1); 

3. tst[£] > wts[£]: thread t has read location £ after its last 
write. In this case tst[£] gets the new value 3q + 2 (line 
17). The order relation wts[£] = 3q + 1 < tst[£] — 
3q + 2 < A[q+1]= 3{q + 1) remains valid. 

Notice that the global timestamp count is assigned with a 
value larger than all the other timestamps (line 19). 

Using binary search to implement lines 7, 10, and 13, 
the running time of the global renumbering algorithm is 
0{p\ogp + /irlogp) where r, /i, and p are the numbers 
of running threads, distinct memory cells, and pending rou- 
tine activations, respectively. This can be amortized against 
^7(2"") thread switch and routine call operations, where w is 
the word memory size. 

5. Implementation 

To prove the feasibility of our approach, we implemented 
a threaded input-sensitive profiler by developing a Val- 
grind [23] tool called aprof-trms. Valgrind provides a 
dynamic instrumentation infrastructure that translates the 
binary code into an architecture-neutral intermediate repre- 
sentation (VEX). Analysis tools provide callbacks for events 
generated by the stream of VEX executed instructions. 

Instrumentation. Similarly to the input-sensitive profiler 
described in jsl], our tool traces all memory accesses and 
function calls and returns. We count basic blocks as a perfor- 
mance measure: tracing function calls and returns requires 
to instrument each basic block, thus counting basic blocks 
adds a light burden to the analysis time overhead, and im- 
proves accuracy in characterizing asymptotic behavior even 
on small workloads. Measuring basic blocks rather than^time 
has several other advantages, very well motivated in |18|]. In 
order to take into account external input, system calls are 



wrapped and properly mapped to one or more kernelRead 
or kernelWrite events: among the main system calls 
on a Linux x86_64 machine, write, sendto, pwrite64, 
writev, msgsnd, and pwritev correspond to kernelRead 
events, while read, recvf rom, pread64, readv, msgrcv, 
and preadv correspond to kernelWrite events. 

Thread interleaving. Under Valgrind, threads of a traced 
application are serialized. This makes the development and 
debugging of a dynamic analysis framework and of its de- 
rived tools easier. Serialization should not be seen as a 
crucial limitation of our implementation: for instance Hel- 
grind flUzi] and DRD S, two popular tools for detect- 
ing synchronization errors in programs that use the POSIX 
pthreads primitives, are both based on Valgrind. Serial- 
ization implies that our profiler does not need to perform 
tie breaking of events (see Section |4|i. However, in a seri- 
alized scenario the scheduling of threads becomes a critical 
concern: thread interleaving may be altered and the execu- 
tion may deviate from non-serialized executions. In order to 
avoid unrealistic executions, our tool takes benefit of the fair 
thread scheduler introduced in the latest release of Valgrind. 

Shadow memories. To reduce space overhead in practice, 
we maintain global and thread-specific shadow memories 
using three-levels lookup tables. A similar approach is also 
adopted by other prominent tools, such as memcheck 12511 . 
A primary table indexes 2048 secondary tables, each cover- 
ing 1GB of address space by indexing 16K chunks. Each 
chunk, in turn, contains the set of 32-bit timestamps for 
64KB address space. In this way only chunks related to 
memory cells actually accessed by a thread need to be shad- 
owed in its thread-specific shadow memory. Hence, on av- 
erage (e.g., with embarrassingly parallel appUcations), the 
accessed primary memory is roughly partitioned among all 
running threads: the overhead for maintaining global and 
thread-specific shadow memories is therefore considerably 
smaller than in the worst-case scenario (where it would be 
proportional to number of running threads x number of dis- 
tinct memory cells). Experiments in Section |6] will largely 
confirm this hypothesis. 



6. Experimental Evaluation 

In this section we discuss the results of an extensive exper- 
imental evaluation of aprof-trms on a variety of bench- 
marks, including the SPEC OMP2012 Q and the Prince- 
ton Application Repository for Shared-Memory Comput- 
ers (PARSEC 2.1) yl]. The goals of our experiments are 
threefold: studying the slowdown and space overhead of 
aprof-trms compared to other heavyweight dynamic anal- 
ysis tools, evaluating the benefits of the TRMS with respect 
to the RMS, and characterizing the amount of thread-induced 
and external input on the considered benchmarks. 
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Table 1. Performance comparison of aprof -trms with aprof and some prominent Valgrind tools on the SPEC OMP2012 
benchmarks. 



6.1 Experimental Setup 

Benchmarks. The OMP20 1 2 benchmark suite of the Stan- 
dard Performance Evaluation Corporation ll20ll is a collec- 
tion of fourteen OpenMP-based applications from different 
science domains. All of them were run on the SPEC input 
train workloads in 64-bit mode. We could successfully test 
all the components except for bt331 and swim, whose exe- 
cution failed due to a Valgrind memory issue. 

The Princeton Application Repository for Shared-Memory 
Computers (PARSEC 2.1) is a benchmark suite for studies 
of Chip-Multiprocessors L3J. It includes different workloads 
chosen from a variety of areas such as computer vision, me- 
dia processing, computational finance, enterprise servers, 
and animation physics. PARSEC defines six input sets for 
each benchmark: experimental results reported in this sec- 
tion are all based on the simlarge sets [3]. 

For the sake of completeness, we also included in our 
tests the MySQL application (version 5.5.30) discussed in 
Section[3j we used the mysqlslap load emulation client |22], 
simulating 50 concurrent clients that submit approximately 
1000 auto-generated queries. 

Metrics. Besides slowdown and space overhead, we use 
the following metrics: 

1. Routine profile richness: for each routine r, let |RMSr| be 
the number of distinct RMS values collected for routine r 
(each value corresponds to a point in the graphs associ- 
ated with r). Similarly, let |TRMSr| be the number of dis- 
tinct TRMS values collected for routine r for all threads. 
The profile richness of routine r is defined as: 



TRMSr 



RMSr 



|RMSr| 

Intuitively, this metric compares the number of distinct 
input values obtained using the TRMS and the RMS, re- 
spectively. We notice that JTRMS^I > |rmSj.| does not 
necessarily hold: it may happen that two distinct RMS 
values X and y (obtained from two different activations 



of a routine) correspond to the same TRMS value z, with 
z > max{a;, y}. Hence, the profile richness may be either 
positive, if more points are collected using the TRMS, or 
negative, if more points are collected using the RMS. We 
will see that in practice the latter case is seldom true. 

2. Input volume: according to Inequality [T] the TRMS of a 
routine activation is always larger than or equal to the 
RMS of the same activation. The input volume metric 
characterizes the increase of the input size values due 
to multi-threading and to external input for an entire 
execution: 



1 



E 



routine activations r 



RMSr 



E 



routine activations r 



TRMSr 



Values of this metric range in [0, 1). If TRMS^ = RMS^ 
for all routine activations r, then the input volume is 0. 
Conversely, if TRMS,. :§> RMS^ for all routine activations 
r, then the input volume gets close to 1 . 

3. Thread-induced input: this metric measures the percent- 
age of induced first-accesses (line 2 of procedure read in 
Figure fTTI) due to multi-threading. 

4. External input: similarly to the previous case, this metric 
measures the percentage of induced first-accesses due to 
external input. 

Evaluated tools. We compared the performance of aprof - 
trms to four reference Valgrind tools: nulgrind, which 
does not collect any useful information and is used only 
for testing purposes, memcheck |25], a tool for detecting 
memory-related errors, callgrind |30], a call-graph gen- 
erating profiler, and helgrind 01911 . a data race detector. 
Although the considered tools solve different analysis prob- 
lems, all of them share the same instrumentation infrastruc- 
ture provided by Valgrind, which accounts for a significant 
fraction of the execution times: memcheck does not trace 
function calls/returns and mainly relies on memory read- 
/write events; callgrind instruments function calls/returns, 
but not memory accesses, and helgrind analyzes concur- 
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Figure 14. (a) Time and (b) space overhead, with respect to 
nulgrind, as a function of the number of threads. 

rent apphcations. We also compared aprof-trms against 
a previous version of aprof based on the RMS metric (see 
Section IH: we remark that aprof-rms targets sequential 
computations only, without taking into account induced first- 
accesses. 

Platform. Experiments were performed on a cluster ma- 
chine with four nodes, each equipped with two 64-bit AMD 
Opteron Processors 6272 @ 2.10 GHz (32 cores), with 64 
GB of RAM running Linux kernel 2.6.32 with gcc 4.4.7 and 
Valgrind 3.8.1 - SVN rev. 13126. 

6.2 Experimental Results 

Slowdown and space overhead. Performance figures of 
our evaluated tools on the SPEC OMP2012 benchmarks, 
obtained spawning four OpenMP threads per benchmark, 
are summarized in Table [T] We do not report results for the 
PARSEC benchmarks because some tools revealed invalid 
memory accesses and other memory issues that prevented a 
reliable comparison of executions under different tools. 

Compared to native execution, all the evaluated tools 
exhibit a large slowdown: even nulgrind, which is re- 
ported to be roughly 5 times slower than native [28^, in 
our experiments turned out to have a mean slowdown fac- 
tor of 23.6 X. aprof-trms is on average 6 times slower 
than nulgrind: this is worse than memcheck, which is 1.5 
times faster than our tool but does not trace function calls 
and returns, and better than helgrind, which is 1.3 times 
slower than aprof-trms and is the only tool designed for 
the analysis of concurrent computations. Recognizing in- 
duced first-accesses causes a 38% overhead on the running 
time, as demonstrated by the comparison of aprof-trms 
with aprof-rms. 

The mean memory requirements of aprof-trms are 
within a factor of 3.3 of native execution. This confirms 
our expectation: if memory is roughly partitioned among the 
four threads, the three-level lookup tables guarantee that the 
overall size of thread-specific shadow memories is propor- 
tional to the size of accessed memory locations. This should 
be added to the size of the global shadow memory, thus ob- 
taining the total 3.3 X space overhead. Even tools that do not 
use shadowing, such as nulgrind and callgrind, require 
at least 1.4x more space than native execution, memcheck 
hinges upon memory shadowing, but turns out to be more ef- 



ficient than aprof-trms thanks to the adoption of memory 
compression schemes and to its independence from the num- 
ber of threads. Similarly, aprof-rms is slightly more effi- 
cient than our tool due to the lack of a global shadow mem- 
ory. On the other hand we remark that helgrind, which is 
akin to our tool with respect to the analysis of concurrency 
issues, uses 36% more space than aprof-trms. 

Figure [14] shows the average slowdown and space over- 
head, with respect to the reference Valgrind tool nulgrind, 
as a function of the number of spawned OpenMP threads. 
All the evaluated tools appear to scale properly. The average 
slowdown slightly decreases as the number of threads in- 
creases: this is because Valgrind serializes the execution of 
threads, and the time spent for instrumentation can be better 
amortized over the serialized executions of a larger number 
of threads. Overall, this experiment confirms the results de- 
tailed in Table[T]in the case of four threads. The mean space 
overhead of callgrind and memcheck is roughly constant: 
their analyses are indeed independent from concurrency is- 
sues. On the other hand, aprof-trms and helgrind show 
a modest increase of the space overhead when the number 
of threads increases: our profiling of the memory usage of 
aprof-trms revealed that the space overhead mostly de- 
pends on shadow memories, whose total space usage, how- 
ever, grows sublinearly with the number of threads. This 
confirms the effectiveness of our implementation based on 
three-level lookup tables. The comparison with aprof-rms 
also suggests that the space overhead due to the global 
shadow memory used by aprof-trms is better amortized 
as the number of threads increases. 



TRMS versus RMS. Our second set of experiments aims 
at evaluating the benefits of the TRMS metric with respect to 
the RMS. As shown in |5], an RMS-based input-sensitive pro- 
filer can collect a significant number of distinct input values 
for most algorithmic-intensive functions, thus producing in- 
formative cost plots. A first natural question is whether using 
TRMS instead of RMS has any positive or negative impact on 
the profile richness. Charts in Figure[T5]contribute to answer 
this question. Each curve is related to a specific benchmark. 
A point {x, y) on a curve means that x% of routines have 
profile richness at least y: e.g., in benchmark dedup, the 
number of points collected using the TRMS metric is more 
than 100 times larger than using the RMS for roughly 4% of 
the routines. As expected, only for a small percentage of rou- 
tines |trms,| is much larger than |RMSr|: this is due to the 
fact that I/O and thread communication are typically encap- 
sulated in a small number of software components. However, 
for these routines JTRMS,. | can be substantially larger than 
JRMSrl, e.g., up to a factor of roughly 10^ for benchmark 
dedup. We also notice that profile richness is negative only 
for a statistically intangible number of routines: this means 
that TRMS-based profiles are (almost) always at least as in- 
formative as those obtained using the RMS. 
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Figure 15. Routine profile richness of TRMS w.r.t. RMS for a representative set of benchmarks. 
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Figure 16. Input volume of TRMS w.rt. RMS for a representative set of benchmarks. 



Due to Inequality [T] TRMS values are always larger that 
RMS values for the same routine activations. Figure [16] char- 
acterizes the increase of the input size values due to in- 
duced first-accesses on a representative set of benchmarks. 
The interpretation of these graphs is similar to Figure [15] 
a point (a;, y) on each benchmark-specific curve means that 
x% of routines have input volume > y. E.g., in benchmark 
fluidanimate roughly 3% of the routines take almost all 
their input from external devices or from other threads. The 
trend of curves in Figure [16] decreases steeply from 100 to 
0, reaching its minimum at a; ~ 8% for most benchmarks: 
this means that 8% of the routines are responsible of thread 
intercommunication and streamed I/O, and the input size of 
these routines cannot be appropriately predicted by the RMS 
metric alone. 

Analysis of induced first-accesses. In the previous exper- 
iments we observed that a non-negligible number of rou- 
tines communicate with other threads or with the kernel 
via system calls. A natural question is how many induced 
first-accesses are due to external input or are thread-induced. 
Figure [TT] answers this question, plotting the percentage of 
thread-induced and external input on a representative set of 
benchmarks: percentages are computed over the total num- 
ber of induced first-accesses, and therefore sum up to 100%. 
Benchmarks are sorted by decreasing percentage of thread- 
induced input (and thus by increasing external input). An 
interesting observation is that the SPEC OMP2012 bench- 
marks get naturally clustered in the leftmost part of the 
histogram (from nab to botsalgn), and all of them have 
thread-induced input larger than 69%. We notice that exter- 
nal input is predominant in vips, which seems in contrast 
with Figure [9] This contradiction, however, is only appar- 
ent and has a clear explanation. Figure |9]plots external input 
on a routine-per-routine basis, while the global percentage 
in Figure [it] is routine-independent: the external input of a 



specific routine also includes the external input of all its de- 
scendants in the call tree, which is instead neglected in the 
global benchmark measure (where each induced first-access 
is counted only once in the percentage computation). Similar 
considerations apply to mysqislap. 

For the sake of completeness. Figure [18] and Figure [19] 
provide a quantitative evaluation of thread-induced and ex- 
ternal input on a routine-per-routine basis: a point (x, y) on 
each benchmark-specific curve means that x% of routines 
have external / thread-induced input > y%. For instance, in 
benchmark dedup, 16% of the routines are such that at least 
20% of their induced first-accesses are due to thread inter- 
communication. These charts are in the spirit of Figure [9] 
but exploit a more compact representation. 

7. Related Work 

There is a vast literature on performance profiling, both 
at the inter- and intra-procedural level: see, e.g., {1^ 2, 41 
I9l4l ll |29[ I33I1 and the references therein. All these works 
aim at associating performance metrics to distinct paths tra- 
versed in the call graph or in the control flow graph during a 
program's execution. Input-sensitivity issues are instead ex- 
plored in HH [11 [m. Marin and Mellor-Crummey JIsl] 
consider the problem of understanding how an application's 
performance scales given different problem sizes, using data 
collected from multiple runs with determinable input pa- 
rameters. Goldsmith, Aiken, and WiUcerson [8] also propose 
to run a program on workloads of different sizes, to mea- 
sure the performance of its routines, and eventually to fit 
these observations to a model that predicts how the perfor- 
mance scales. The workload size of the program's routines, 
however, is not computed automatically. Algorithmic pro- 
filing by Zaparanuks and Hauswirth [31], besides identify- 
ing boundaries between different algorithms in a program, 
infers their computational cost, which is related to the in- 
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Figure 17. External vs. thread-induced input. 
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Figure 18. Thread-induced input on a routine basis. 



put size. The notion of input size is defined at a high level 
of abstraction, using different definitions for different data 
structures (e.g., the size of an array or the number of nodes 
in a tree). Instead, the input-sensitive profiling methodology 
described in ISO, which provides the basis for our approach, 
automatically infers the input size by tracing low-level mem- 
ory accesses performed by different routines. None of these 
approaches addresses concurrency issues, being thus limited 
to sequential computations. 

The problem of empirically studying the asymptotic be- 
havior of a program has been the target of extensive research 
in experimental algorithmics ja [iTl llSll . where individual 
portions of algorithmic code are extracted from applications 
and separately analyzed on ad-hoc test harnesses. This ap- 
proach has some drawbacks as a performance evaluation 
method in actual software development, most prominently 
the fact that, by studying performance-critical routines out 
of their context, possible performance effects due to the in- 
teraction with the overall application might be missed. 

A variety of parallelism-related profilers have been pro- 
posed to help programmers parallelize complex codes by 
uncovering the dependencies between different regions of 
the program: examples include pp 1141 . Alchemist 1321 . and 
Kremhn JTIl . Other profilers for multicore machines, such 
as il3lll6ll24ll . focus on NUMA-related performance issues 
and exploit detailed temporal information about memory ac- 
cesses in order to build temporal flows of interactions be- 
tween threads and objects. This is similar to our problem 
of relating memory accesses with thread intercommunica- 
tion, although the final goal is orthogonal to ours, since these 
works aim at understanding the speed improvements that can 
result from parallelizing different portions of code, from ex- 
ecuting a program on a parallel platform, or from diagnosing 
and reducing distant memory accesses. 

8. Conclusions 

In this paper we have extended the input-sensitive profiling 
methodology to concurrent computations. Input-sensitive 
profiling requires to measure automatically the size of the 
input given to a generic code fragment: in a multithreaded 
scenario, this raises a variety of interesting issues mainly 
due to thread intercommunication via shared memory. At 
this aim, we have proposed a novel metric, called TRMS, that 
gives an estimate of the size of the input of each routine acti- 
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Figure 19. External input on a routine basis. 

vation by taking into account first-accesses, possibly induced 
by other threads or by kernel system calls. We have shown 
that our approach is both methodologically sound and prac- 
tical. Namely, our Valgrind-based implementation achieves 
performances comparable to other prominent heavyweight 
analysis tools. As a future direction, it would be interesting 
to adapt our methodology to a fully scalable and concur- 
rent dynamic instrumentation framework, in order to exploit 
parallelism to leverage the slowdown of our profiler 

Our methodology raises many interesting open issues 
regarding input characterization and thread intercommuni- 
cation in concurrent applications. Measures derived from 
TRMS might allow it to evaluate concurrency-related aspects 
and to discover how multithreaded applications scale their 
work and how they communicate via shared memory. E.g., 
in a recent experimental study 111211 . it has been observed that 
even widespread multithreaded benchmarks do not interact 
much or interact only in limited ways, and that communica- 
tion does not change predictably as a function of the number 
of cores. This study exploits a characterization of read/write 
memory accesses, and we believe that the trms might shed 
new light towards this direction. 
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